Learnability-Based Syntactic Annotation Design
نویسندگان
چکیده
There is often more than one way to represent syntactic structures, even within a given formalism. Selecting one representation over another may affect parsing performance. Therefore, selecting between alternative syntactic representations (henceforth, syntactic selection) is an essential step in designing an annotation scheme. We present a methodology for syntactic selection and apply it to six central dependency structures. Our methodology compares pairs of annotation schemes that differ in the annotation of a single structure. It selects the more learnable scheme, namely the one that can be better learned using statistical parsers. We find that in three of the structures, one annotation is unequivocally better than the alternatives. Our results are consistent over various settings involving five parsers and two definitions of learnability. Furthermore, we show that the learnability gains incurred by our selections are both considerable (error reductions of up to 19.8%) and additive. The contribution of this work is in demonstrating that syntactic selection has a substantial and predictable effect on parsing performance, and showing that this effect can be effectively used in designing syntactic annotation schemes.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملDownstream use of syntactic analysis: does representation matter?
Research in syntactic parsing is largely driven by progress in intrinsic evaluation and there have been impressive developments in recent years in terms of evaluation measures, such as F-score or labeled accuracy. At the same time, a range of different syntactic representations have been put to use in treebank annotation projects and there have been studies measuring various aspects of the ”lea...
متن کاملExploring Gene Expression Data with Class Scores
We address a commonly asked question about gene expression data sets: "What functional classes of genes are most interesting in the data?" In the methods we present, expression data is partitioned into classes based on existing annotation schemes. Each class is then given three separately derived "interest" scores. The first score is based on an assessment of the statistical significance of gen...
متن کاملThe Design of Syntactic Annotation Levels in the National Corpus of Polish
This paper presents the procedure of the syntactic annotation of the National Corpus of Polish. Syntactic annotation consists here of shallow parsing and manual post-editing of the results by annotators. The description concentrates on the delimitation of syntactic words and groups, as well as on problems encountered during the annotation process.
متن کامل